Introduction

I chose the “Medical Insurance” dataset. This dataset primarily records the different characteristics among members of health insurance companies and the premiums they currently pay. We know that in order to make a profit, insurance companies should collect higher premium than the amount paid to the insured person, so how to charge different fees for different types of customers is a very critical issue for insurance companies. I want to use the “insurance” dataset to explore which behaviors and performances of insurance company members have a significant impact on their premiums, so as to help insurance companies formulate the best pricing strategy.

The main methods used in this study:

After analyzing the relationship between each variable and between each variable and predictor, I decided to use linear regression to predict charges. Then, there were two models created with different numbers of variables. Moreover I evaluated the two models’ peformance and chose a better one. Finally, I created a prediction function based on the regression results and applied it to possible cases.

Library and Import

library(tidyverse)
library(MLmetrics)
library(Hmisc)
library(ggplot2)
library(ggthemes)
library(PerformanceAnalytics)
library(readr)
insurance <- read_csv("~/Desktop/insurance.csv")

Understand the data

head(insurance,5)
## # A tibble: 5 × 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
str(insurance)
## spec_tbl_df [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr [1:1338] "female" "male" "male" "male" ...
##  $ bmi     : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
##  $ children: num [1:1338] 0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr [1:1338] "yes" "no" "no" "no" ...
##  $ region  : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   bmi = col_double(),
##   ..   children = col_double(),
##   ..   smoker = col_character(),
##   ..   region = col_character(),
##   ..   charges = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
describe(insurance)
## insurance 
## 
##  7  Variables      1338  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1338        0       47    0.999    39.21    16.21       18       19 
##      .25      .50      .75      .90      .95 
##       27       39       51       59       62 
## 
## lowest : 18 19 20 21 22, highest: 60 61 62 63 64
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##     1338        0        2 
##                         
## Value      female   male
## Frequency     662    676
## Proportion  0.495  0.505
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1338        0      548        1    30.66    6.893    21.26    22.99 
##      .25      .50      .75      .90      .95 
##    26.30    30.40    34.69    38.62    41.11 
## 
## lowest : 15.960 16.815 17.195 17.290 17.385, highest: 48.070 49.060 50.380 52.580 53.130
## --------------------------------------------------------------------------------
## children 
##        n  missing distinct     Info     Mean      Gmd 
##     1338        0        6    0.899    1.095    1.275 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency    574   324   240   157    25    18
## Proportion 0.429 0.242 0.179 0.117 0.019 0.013
## --------------------------------------------------------------------------------
## smoker 
##        n  missing distinct 
##     1338        0        2 
##                       
## Value         no   yes
## Frequency   1064   274
## Proportion 0.795 0.205
## --------------------------------------------------------------------------------
## region 
##        n  missing distinct 
##     1338        0        4 
##                                                   
## Value      northeast northwest southeast southwest
## Frequency        324       325       364       325
## Proportion     0.242     0.243     0.272     0.243
## --------------------------------------------------------------------------------
## charges 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1338        0     1337        1    13270    12301     1758     2347 
##      .25      .50      .75      .90      .95 
##     4740     9382    16640    34832    41182 
## 
## lowest :  1121.874  1131.507  1135.941  1136.399  1137.011
## highest: 55135.402 58571.074 60021.399 62592.873 63770.428
## --------------------------------------------------------------------------------
# identify duplicate value
insurance[duplicated(insurance), ]
## # A tibble: 1 × 7
##     age sex     bmi children smoker region    charges
##   <dbl> <chr> <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 male   30.6        0 no     northwest   1640.
insurance <- insurance %>% 
  distinct()

From the above information and combine with the backgroud of this dataset, we can get:

There are 7 variables, 1338 observations in this dataset, and there is no any missing values, only one duplicate value.

  • Age: insurance contractor age; from 18 to 64 years old; the average age is 39; continous variable.

  • Sex: insurance contractor gender; including female and male; the proportions of them are almost evenly distributed; categorical variable.

  • BMI: body mass index, a value calculated from a person’s mass (weight) and height; from 15.96 to 53.13; the average bmi is 30.66; continous variable.

  • Children: numbers of children insurance contractor has; from 0 to 5; most contractors have no more than 2 children; continous variable.

  • Smoker: show insurance contractor smoking or not; including yes and no; non-smokers are four times as likely as smokers; categorical variable.

  • Region: show insurance contractor’s living area in the US; including four areas-northeast, southeast, southwest, northwest; the proportions of them are almost evenly distributed; categorical variable.

  • Charges: Expenses to be paid by insurance contractor as set by insurance company; from 1121 to 63770; the average charges is 13270; continous variable and predict value.

Exploratory Data Analysis and Visualization

Age

age_charges <- ggplot(insurance, aes(age, charges)) +
  geom_jitter(color = 'green', alpha = 0.5) + theme_clean() + ggtitle("Plot of Charges by Age") 
age_charges

We can see that as the age increases, the charges also increase accordingly. At the same time, there are three clear linear positive correlation lines in this plot, but mainly concentrated on the bottom line.

BMI

BMI_charges <- ggplot(insurance, aes(bmi, charges)) +
  geom_jitter(color = "orange", alpha = 0.5) + theme_clean() + ggtitle("Plot of Charges by BMI") 
BMI_charges

Overall, charges increased with BMI. However, the trend was not as clear as age, so I decide to look more details of BMI.

Modified BMI

After I did some research, I found that BMI could be classified into the following categories

  • underweight: bmi<18.5;
  • healthy weight: 18.5<bmi<25;
  • overweight : 25=<bmi<30;
  • obesity: bmi>=30
insurance<- insurance %>% 
  mutate(bmi_cat = case_when(
    bmi < 18.5 ~ "underweight",
    bmi < 25 ~ "healthy weight",
    bmi < 30 ~ "overweight",
    bmi >= 30 ~ "obesity"
  ))

insurance$bmi_cat = factor(insurance$bmi_cat, 
                    levels = c("underweight", 
                               "healthy weight",
                               "overweight",
                               "obesity"))
insurance %>% 
  ggplot(aes(x = bmi, y = charges, color = bmi_cat)) +
  geom_point() +
  labs(title = "Plot Charges By BMI_cat",
       x = "Bmi",
       y = "Charges ($)") +
  facet_wrap(~bmi_cat)

insurance %>% 
  ggplot(aes(x = age, y = charges, color = bmi_cat)) +
  geom_point() +
  labs(title = "Plot Charges By Age with BMI_cat",
       x = "Age",
       y = "Charges ($)") +
  facet_wrap(~bmi_cat)

From the above plots, we can see that people with a bmi over 30 (obesity), especially order people, have higher charges than those with a bmi below 30.

Children

children_charges <- ggplot(insurance, aes(children, charges)) +
  geom_jitter(aes(color = children), alpha = 0.5) + theme_clean() + ggtitle("Plot of Charges by Children") 
children_charges

From this plot, we get that in families with the number of children from 0 to 5, the charges for clients with five children is lower.

Gender

gender_charges <- ggplot(insurance,aes(sex,charges)) + geom_boxplot(fill = c("red", "blue")) +
  theme_clean() + ggtitle("Plot of Charges by Gender")
gender_charges

Charges barely differ by gender.

Smoker

smoker_charges <- ggplot(insurance,aes(smoker,charges)) + geom_boxplot(fill = c("green", "red")) +
  theme_clean() + ggtitle("Plot of Charges by Smoke Status")
smoker_charges

The result, as I expected, showed that it was more likely have huge charges for smokers than non-smokers.

Region

region_charges <- ggplot(insurance,aes(region,charges)) + geom_boxplot(fill = c(5:8)) +
  theme_clean() + ggtitle("Plot of Charges per Region")
region_charges

There is little difference in charges in the four regions above.

Overall,by analyzing each of the above variables one by one, I initially found the much effect of age, BMI, smokers and number of children on charges. And smoker seems to be a very essential factor here.

So now, I would like to check how charges varies by Age, BMI, and number of Children according to smoker.

p1 <- ggplot(data = insurance, aes_string(x = 'age', y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) + 
    geom_jitter() + 
    geom_smooth(formula = y ~ x, method = "lm") +
    ggtitle(glue::glue("Charges vs Age "))
p1

p2 <- ggplot(data = insurance, aes_string(x = 'bmi', y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) + 
    geom_jitter() + 
    geom_smooth(formula = y ~ x, method = "lm") +
    ggtitle(glue::glue("Charges vs BMI "))
p2

p3 <- ggplot(data = insurance, aes_string(x = 'children', y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) + 
    geom_jitter() + 
    geom_smooth(formula = y ~ x, method = "lm") +
    ggtitle(glue::glue("Charges vs Children "))
p3

we can see that even under the influence of different variables, the charges of smokers is always higher than that of non-smokers

Now, I am going to run the chart.Correlation to get an overview of correlation among variables and between each variable and charges.

Overview of variables correlation

insurance$sex[insurance$sex == "male"] <- 1
insurance$sex[insurance$sex == "female"] <- 0

insurance$smoker[insurance$smoker == "yes"] <- 1
insurance$smoker[insurance$smoker == "no"] <- 0

insurance<- insurance %>%
  mutate(smoker = as.numeric(smoker)) %>%
  mutate(sex = as.numeric(sex))

df <- insurance[, c(1,2,3,4,5,7)]
chart.Correlation(df, histogram=TRUE, pch=19)

From this plot, we can see that smoker has the highest correlation with charges. At the same time, we can see that the relationship between Age and Bmi and the relationship between Sex and Smoker are related. It makes sense. Generally speaking, older people are more likely to have a high BMI, and men are more likely to be smokers than women. But in overall, except for the predictor of charges, the correlation coefficient between the variables is very small.

Linear Regression Model

splitting the data into train and test

set.seed(1)
row.number <- sample(1:nrow(insurance), round(0.8*nrow(insurance)))
train = insurance[row.number,]
test = insurance[-row.number,]
dim(train)
## [1] 1070    8
dim(test)
## [1] 267   8

Running the first model (include all the variables)

mod1 <- lm(charges ~ age + bmi_cat + children + smoker + region + sex, train)
prediction1 <- predict(mod1,test)
summary(mod1)
## 
## Call:
## lm(formula = charges ~ age + bmi_cat + children + smoker + region + 
##     sex, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12081.9  -3705.3   -299.5   1762.8  27502.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -6122.07    1588.26  -3.855 0.000123 ***
## age                     266.56      13.46  19.799  < 2e-16 ***
## bmi_cathealthy weight  1255.89    1558.63   0.806 0.420558    
## bmi_catoverweight      2366.89    1535.70   1.541 0.123556    
## bmi_catobesity         6225.18    1524.39   4.084 4.77e-05 ***
## children                473.10     157.38   3.006 0.002709 ** 
## smoker                23481.90     461.80  50.849  < 2e-16 ***
## regionnorthwest        -701.90     540.75  -1.298 0.194560    
## regionsoutheast        -451.47     530.27  -0.851 0.394745    
## regionsouthwest       -1000.80     535.95  -1.867 0.062129 .  
## sex                     -38.37     376.45  -0.102 0.918825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6126 on 1059 degrees of freedom
## Multiple R-squared:  0.7494, Adjusted R-squared:  0.747 
## F-statistic: 316.6 on 10 and 1059 DF,  p-value: < 2.2e-16
  • From the first model we built, firstly, we can see that F(324.4) is far greater than 1. It can be concluded that there is a relationship between variables and predictor(charges) .

  • And as with our previous observations, Age, Bmi(obesity), Children and Smokers are the significant variables (have less p value) for charges.

  • Then, based on the Adjusted r-squared, we got 0.7516, which is close to 1. It indicates that our first model explains the 75.16% of the variation of charges and hence a good fit.

Running the second model (only include the significant variables)

mod2<- lm(charges ~ age + bmi_cat + children + smoker, train)
prediction2 <- predict(mod2, test)
summary(mod2)
## 
## Call:
## lm(formula = charges ~ age + bmi_cat + children + smoker, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12455.9  -3741.9   -290.8   1654.4  27551.1 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -6570.69    1559.02  -4.215 2.71e-05 ***
## age                     267.01      13.44  19.861  < 2e-16 ***
## bmi_cathealthy weight  1201.60    1554.71   0.773   0.4398    
## bmi_catoverweight      2223.92    1529.48   1.454   0.1462    
## bmi_catobesity         6097.65    1513.28   4.029 5.99e-05 ***
## children                462.88     157.19   2.945   0.0033 ** 
## smoker                23499.25     459.46  51.145  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6126 on 1063 degrees of freedom
## Multiple R-squared:  0.7485, Adjusted R-squared:  0.7471 
## F-statistic: 527.2 on 6 and 1063 DF,  p-value: < 2.2e-16
mae1 <- MAE(prediction1, test$charges)
rmse1 <- RMSE(prediction1, test$charges)
mod1_reg <- cbind("MAE" = mae1, "RMSE" = rmse1)
mod1_reg
##          MAE     RMSE
## [1,] 3976.93 5621.011
mae2 <- MAE(prediction2, test$charges)
rmse2 <- RMSE(prediction2, test$charges)
mod2_reg <- cbind("MAE" = mae2, "RMSE" = rmse2)
mod2_reg
##          MAE     RMSE
## [1,] 3966.14 5613.493
  • From this new model, we can see that F(541.5) is far greater than 1 and this value is more than the F value(324.4) of the previous model. It makes sense, because we removed the less significant feature.

  • The relationship between the number of children and charges is relatively insignificant in this new model compared to other variables, but Children still has a low P value.

  • For the Adjusted r-squared of model 2, we can see it a little bit higher but also quite similar to the first model’s.

  • Then compare the RMSE of the two models, we got that RMSE2(6479.594) is higher than RMSE1(6459.722). It means model2 has a relatively larger gap between the predicted and observed values.

Overall, I would say there is no big differences between the above two models, and I am going to take model 2 as our final regression model since it only has four observed values, which is more easier to execute following predict work.

Prediction vs. Actual values of the selected model (model 2)

prediction<- predict(mod2, test)
ggplot(test, aes(x = prediction, y = charges)) + geom_point(color = "black", alpha = 0.5) + geom_abline(color = "red") +
  ggtitle("Prediction vs. Actual values")

Because each data point is quite close to the projected regression line, especially within 20000 charges, we may conclude that our model 2 fits the data reasonably well.

Prediction Function

Based on the model2, the equation will be like: charges = -6305.54 +265.39 (age) + 5591.39(bmi_obesity) + 461.11(children) + 23422.51(smoker).

predict_charge <- function(age, bmi, children, smoker){
    if(smoker == "yes"  && bmi < 30){
        result =  -6305.54 + (265.39*age) + (461.11*children) + 23422.51
    } else if(smoker == "yes" && bmi >= 30) {
        result =  -6305.54 + (265.39*age) + (461.11*children) + (5591.39*bmi) + 23422.51
    } else if(smoker == "no" && bmi < 30){
        result = -6305.54 + (265.39*age) + (461.11*children)
    } else if(smoker == "no" &&  bmi >= 30){
        result =  -6305.54 + (265.39*age) + (461.11*children) + (5591.39*bmi)
    } 
    return(result)
}

Case study

Tom:35 years old; bmi is 34.4; 0 child; smoke.

We can get his charge supposed to be:

predict_charge(35,34.4,0,"yes")
## [1] 218749.4

Conclusion

After analysis I found that smoking is the most important factor in increasing premiums.

At the same time, with the increase of age and BMI, the premium should also increase accordingly, especially when the insurer’s BMI exceeds 30 (obesity).

Therefore, we are not able to change our age for sure, but we can try to keep fit and avoid smoking. That’s good for your wallet and your health;)